4. Searching

4.1 Searching Lists

Algorithm: Selection

Finding the median of a collection of numbers is selection problem with $k=(n+1)/2$ if $n$ is odd (and, if $n$ is even, the median is the mean of the $k$th and $(k+1)$th smallest items, where $k=n/2$).

Initial Insight

Choose a value from $S$, to be used as pivotValue. Then divide the list into two partitions, leftPart (containing the list items that are smaller than pivotValue) and rightPart (containing the list items that are greater than pivotValue). If the $k$th smallest item has been found, stop. Otherwise, select the partition that must contain the $k$th smallest item, and do the whole thing again with this partition.

Specification

Name: Selection
Inputs: A sequence of integers $S = \{s_1, s_2, s_3, ..., s_n\}$
An integer $k$
Outputs: An integer $x$
Preconditions: Length of $S>0$ and $k>0$ and $k\le n$
Postcondition: $x$ is the $k$th smallest item in $S$

Code


In [22]:
def quickSelect(k, aList):

    if len(aList) == 1: 
        return aList[0]     # Base case
    
    pivotValue = aList[0]
    leftPart = []
    rightPart = []
    
    for item in aList[1:]:
        if item < pivotValue: 
            leftPart.append(item)
        else: 
            rightPart.append(item)
    
    if len(leftPart) >= k: 
        return quickSelect(k, leftPart)
    elif len(leftPart) == k - 1: 
        return pivotValue
    else: 
        return quickSelect(k - len(leftPart) -1, rightPart)    



print("Median:", quickSelect(6, [2, 36, 5, 21, 8, 13, 11, 20, 4, 1]))


Median: 11

In [23]:
def quickSelect(k, aList):

    if len(aList) == 1: return aList[0]
    
    pivotValue = aList[0]
    leftPart = [x for x in aList[1:] if x < pivotValue]
    rightPart = [x for x in aList[1:] if not x < pivotValue]

    if len(leftPart) >= k: return quickSelect(k, leftPart)
    elif len(leftPart) == k - 1: return pivotValue
    else: return quickSelect(k - len(leftPart) -1, rightPart)    



print("Median:", quickSelect(6, [2, 36, 5, 21, 8, 13, 11, 20, 4, 1]))


Median: 11

Remarks

The crucial step (cf. Quick Sort) that determines whether we have best case or worst case performance is the choice of the pivot – if we are really lucky we will get a value that cuts down the list the algorithm needs to search very substantially at each step.

The algorithm is divide-and-conquer and each iteration makes the sub-problem substantially smaller. In Quick Sort, both partitions are sorted recursively and provided that the pivot, at each stage, divides the list up into equal parts, we achieve $O(n $log$ n)$ complexity.

However, in the Selection algorithm we know which partition to search, so we only deal with one of them on each recursive call and as a result it is even more efficient. Hence, it can be shown that its complexity is $O(n)$.

4.2 Searching for patterns

It often happens that we need to search through a string of characters to find an occurrence (if there is one) of a given pattern, e.g. genetics and DNA searches, keyword searches.

Algorithm: StringMatch

We are representing the sequence to be searched simply as a string of characters, referred to as the search string $S$, a shorter sequence is the target string $T$ and we are trying to find where the first occurrence of $T$ is, if it is present in $S$.

Initial Insight

Repeatedly shift $T$ one place along $S$ and then compare the characters of $T$ with those of $S$. Do this until a match of $T$ in $S$ is found, or the end of $S$ is reached.

Specification

Name: StringMatch
Inputs: A search string $S = (s_1, s_2, s_3, ..., s_n)$
A target string $T = (t_1, t_2, t_3, ..., t_m)$
Outputs: An integer $x$
Preconditions: $m\le n$, $m>0$ and $n>0$
Postcondition: If there is an occurrence of $T$ in $S$, $x$ is the start position of the first occurrence of $T$ in $S$; otherwise $x = -1$

Code


In [85]:
def basicStringSearch(searchString, target):

    searchIndex = 0
    
    lenT = len(target)    
    lenS = len(searchString)    
    
    while searchIndex + lenT <= lenS:

        targetIndex = 0

        while targetIndex < lenT and target[targetIndex] == searchString[ targetIndex + searchIndex]:
            targetIndex += 1

        if targetIndex == lenT:
            return searchIndex

        searchIndex += 1

    return -1

In [86]:
# Test Code
for target, index in [('per', 0), ('lta', 14), ('ad', 10), ('astra', -1)]:
    print(basicStringSearch('per ardua ad alta', target)==index)


True
True
True
True

Remarks

It becomes immediately apparent when implement that this algorithm would consist of two nested loops leading to complexity $O(mn) > O(m^2)$.

We know that if the character in $S$ following the failed comparison with $T$ is not in $T$ then there is no need to slide along one place to do another comparison. We should slide to the next point beyond it. This gives us the basis for an improved algorithm.

Initial Insight

For each character in $T$ calculate the number of positions to shift $T$ if a comparison fails, according to where (if at all) that character appears in $T$.

Repeatedly compare the characters of $T$ with those of $S$. If a comparison fails, examine the next character along in $S$ and shift $T$ by the calculated shift distance for that character.

Do this until an occurrence of $T$ in $S$ is found, or the end of $S$ is reached.

Remarks

An important point to note first of all is that the part of the algorithm calculating the shifts depends entirely on an analysis of the target string $T$ – there is no need to examine the search string $S$ at all because for any character in $S$ that is not in $T$, the shift is a fixed distance.

The database is called a shift table and it stores a shift distance for each character in the domain of $S$ – e.g. for each character of the alphabet, or say, all upper and lower case plus punctuation.

The shift distance is calculated according to the following rules:

  1. If the character does not appear in T, the shift distance is one more than the length of T.
  2. If the character does appear in T, the shift distance is the first position at which it appears, counting from right to left and starting at 1. (Hence when a character appears more than once in $T$ keeps the lowest position.)

Suppose $S = $'GGGGGAGGCGGCGGT'. Then for target string $T = $'TCCACC', we have:

G A C T
7 3 1 6
and if $T = $'TGGCG', we have:
G A C T
1 6 2 5


Once the shift table has been computed, the search part of the quick search algorithm is similar to the basic string search algorithm, except that at the end of each failed attempt we look at the next character along in $S$ that is beyond $T$ and use this to look up in the shift table how many steps to slide $T$.

We implement the shift table as a dictionary in Python:

Code


In [87]:
def buildShiftTable(target, alphabet):

    shiftTable = {}

    for character in alphabet:
        shiftTable[character] = len(target) + 1

    for i in range(len(target)):
        char = target[i]
        shift = len(target) - i
        shiftTable[char] = shift

    return shiftTable

In [88]:
def quickSearch (searchString, target, alphabet):

    shiftTable = buildShiftTable(target, alphabet)
    searchIndex = 0

    while searchIndex + len(target) <= len(searchString):
 
        targetIndex = 0

        # Compares the strings    
        while targetIndex < len(target) and target[targetIndex] == searchString[searchIndex + targetIndex]:
            targetIndex = targetIndex + 1

        # Return index if target found
        if targetIndex == len(target): return searchIndex

        # Continue search with new shivt value or exit
        if searchIndex + len(target) < len(searchString):
            next = searchString[searchIndex + len(target)]
            shift = shiftTable[next]
            searchIndex = searchIndex + shift
        else:
            return -1

    return -1

Tests


In [89]:
theAlphabet = {'G', 'A', 'C', 'T'}
stringToSearch = 'ATGAATACCCACCTTACAGAAACCTGGGAAAAGGCAATAAATATTATAAAAGGTGAACTTACAGAAGTAA'

for thetarget in ['ACAG', 'AAGTAA', 'CCCC']:
    print(quickSearch(stringToSearch, thetarget, theAlphabet))


15
64
-1

Remarks

The basic brute-force algorithm we wrote first will work fine with relatively short search strings but, as with all algorithms, inputs of huge size may overwhelm it. For example, DNA strings can be billions of bases long, so algorithmic efficiency can be vital. We noted already that the complexity of the basic string search can be as bad as O(nm) in the worst case.

As for the quick search algorithm, research has shown that its average-case performance is good but, unfortunately, its worst case behaviour is still O(mn).

Knuth–Morris–Pratt (KMP)

Better algorithms have been developed. One of the best-known efficient search algorithms is the Knuth–Morris–Pratt (KMP) algorithm. A full description of the precise details of the KMP algorithm is beyond the scope of this text.

Algorithm: Knuth–Morris–Pratt (KMP)

The KMP algorithm is in two parts:

  1. Build a table of the lengths of prefix matches up to every character in the target string, $T$.
  2. Move along the search string, $S$, using the information in the table to do the shifting and compare.

Once the prefix table has been built, the actual search in the second step proceeds like the other string-searching algorithms above, but when a mismatch is detected the algorithm uses the prefix table to decide how to shift $T$. The problem is to know if these prefix matches exist and – if they do – how long the matching substrings are.</br>

The prefix will then be aligned as shown in Figure 4.17 and comparison can continue at the next character in S.

If you want to take the trouble, you can verify that the final table will be:


In [100]:
prefixTable = [0, 1, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 1, 2]

Code


In [92]:
# Helper function for kmpSearch()

def buildPrefixTable(target):    

    #The first line of code just builds a list that has len(target)
    #items all of which are given the default value 0

    prefixTable = [0] * len(target)
    q = 0

    for p in range(1, len(target)):

        while q > 0 and target[q] != target[p]:
            q = prefixTable[q - 1]

        if target[q] == target[p]:
            q = q + 1
        
        prefixTable[p] = q

    return prefixTable

In [93]:
def kmpSearch(searchString, target):

    n = len(searchString)
    m = len(target)
    prefixTable = buildPrefixTable(target)
    q = 0

    for i in range(n):

        while q > 0 and target[q] != searchString[i]:
            q = prefixTable[q - 1]

        if target[q] == searchString[i]:
            q = q + 1
    
        if q == m:
            return i - m + 1

    return -1

Tests


In [94]:
stringToSearch = 'ATGAATACCCACCTTACAGAAACCTGGGAAAAGGCAATAAATATTATAAAAGGTGAACTTACAGAAGTAA'

for thetarget in ['ACAG', 'AAGTAA', 'CCCC']:
    print(kmpSearch(stringToSearch, thetarget))


15
64
-1

Remarks

What about the complexity of the KMP algorithm? Computing the prefix table takes significant effort but in fact there is an efficient algorithm for doing it. Overall, the KMP algorithm has complexity $O(m + n)$. Since $n$ is usually enormously larger than $m$ (think of searching a DNA string of billions of bases), $m$ is usually dominated by $n$, so this means that KMP has effective complexity $O(n)$.

Other Algorithms

String search is an immensely important application in modern computing, and at least 30 efficient algorithms have been developed for the task. Many of these depend on the principle embodied in the quick search and KMP algorithms – shifting the target string an appropriate distance along the search string at each step, based on information in a table. The Boyer–Moore algorithm, for example, combines elements of both these two algorithms. This algorithm is widely used in practical applications.

There are also string-search algorithms that work in entirely different ways from the examples we have looked at. Generally, these are beyond the scope of this text, but some are based on hashing functions, which we now move on to discuss next.

4.3 Hashing and Hash Tables

Hashing

We have seen how we are able to make improvements in search algorithms by taking advantage of information about where items are stored in the collection with respect to one another. For example, by knowing that a list was ordered, we could search in logarithmic time using a binary search. In this section we will attempt to go one step further by building a data structure that can be searched in $O(1)$ time. This concept is referred to as hashing.

In order to do this, we will need to know even more about where the items might be when we go to look for them in the collection. If every item is where it should be, then the search can use a single comparison to discover the presence of an item.

A hash table is a collection of items which are stored in such a way as to make it easy to find them later. Each position of the hash table, often called a slot, can hold an item and is named by an integer value starting at 0.

Below is a hash table of size $m=11$ implemented in Python as a list with empty slots intialized with a default None value:

The mapping between an item and the slot where that item belongs in the hash table is called the hash function. The hash function will take any item in the collection and return an integer in the range of slot names, between $0$ and $m-1$.

Our first hash function, sometimes referred to as the remainder method, simply takes an item and divides it by the table size, returning the remainder as its hash value:


In [156]:
set_of_integers = [54, 26, 93, 17, 77, 31]
hash_function = lambda x: [y % 11 for y in x]
hash_vals = hash_function(set_of_integers)
hash_vals


Out[156]:
[10, 4, 5, 6, 0, 9]

Once the hash values have been computed, we can insert each item into the hash table at the designated position:

Now when we want to search for an item, we simply use the hash function to compute the slot name for the item and then check the hash table to see if it is present. This searching operation is $O(1)$, since a constant amount of time is required to compute the hash value and then index the hash table at that location. If everything is where it should be, we have found a constant time search algorithm.

It immediately becomes apparent that this technique is going to work only if each item maps to a unique location in the hash table. When two or more items would need to be in the same slot. This is referred to as a collision (it may also be called a “clash”). Clearly, collisions create a problem for the hashing technique. We will discuss them in detail later.

Hash Functions

Given a collection of items, a hash function that maps each item into a unique slot is referred to as a perfect hash function.

If we know the items and the collection will never change, then it is possible to construct a perfect hash function (refer to the exercises for more about perfect hash functions). Unfortunately, given an arbitrary collection of items, there is no systematic way to construct a perfect hash function. Luckily, we do not need the hash function to be perfect to still gain performance efficiency.

One way to always have a perfect hash function is to increase the size of the hash table so that each possible value in the item range can be accommodated. This guarantees that each item will have a unique slot. Although this is practical for small numbers of items, it is not feasible when the number of possible items is large. For example, if the items were nine-digit Social Security numbers, this method would require almost one billion slots. If we only want to store data for a class of 25 students, we will be wasting an enormous amount of memory.

Our goal is to create a hash function that minimizes the number of collisions, is easy to compute, and evenly distributes the items in the hash table. There are a number of common ways to extend the simple remainder method. We will consider a few of them here.

The folding method for constructing hash functions begins by dividing the item into equal-size pieces (the last piece may not be of equal size). These pieces are then added together to give the resulting hash value.

For example, if our item was the phone number $436-555-4601$, we would take the digits and divide them into groups of $2$ and sum them; that is $43+65+55+46+01=210$. If we assume our hash table has $11$ slots, then we need to perform the extra step of dividing by $11$ and keeping the remainder. In this case $210 % 11210 % 11 = 1$, so the phone number $436-555-4601$ hashes to slot $1$. (Some folding methods go one step further and reverse every other piece before the addition. For the above example, we get $43+56+55+64+01=219$ which gives $219 % 11=10219 % 11=10$.)


In [176]:
word = 4365554601
word = str(word)
step = 2
slots = 11
folds = [int(word[n: n+2]) for n in range(0, len(word), step)]

print(folds)
print(sum(folds))
print(sum(folds)%slots)


[43, 65, 55, 46, 1]
210
1

Another numerical technique for constructing a hash function is called the mid-square method. We first square the item, and then extract some portion of the resulting digits. For example, if the item were $44$, we would first compute $44^2=1,936$. By extracting the middle two digits, $93$, and performing the remainder step, we get remainder of $5$ on division by $11$.


In [144]:
set_of_integers = [54, 26, 93, 17, 77, 31]
hash_function = lambda x: [int(str(y**2)[1:-1])%11 for y in x]
hash_vals = hash_function(set_of_integers)
hash_vals


Out[144]:
[3, 7, 9, 8, 4, 6]

We can also create hash functions for character-based items such as strings. The word “cat” can be thought of as a sequence of ordinal values. Summing these (unicode values), summing and then taking the remainder from division by $11$:


In [145]:
word = 'cat'
sum([ord(l) for l in word]) % 11


Out[145]:
4

To avoid conflicts from anagram, we could weights:


In [146]:
sum([(ord(word[x]) * (x + 1)) for x in range(len(word))]) % 11


Out[146]:
3

You may be able to think of a number of additional ways to compute hash values for items in a collection. The important thing to remember is that the hash function has to be efficient so that it does not become the dominant part of the storage and search process. If the hash function is too complex, then it becomes more work to compute the slot name than it would be to simply do a basic sequential or binary search as described earlier. This would quickly defeat the purpose of hashing.

Collision Resolution

If the hash function is perfect, collisions never occur. However, since this is often not possible. When two items hash to the same slot, we must have a systematic method for placing the second item in the hash table. This process is called collision resolution.

One method for resolving collisions looks into the hash table and tries to find another open slot to hold the item that caused the collision. A simple way to do this is to start at the original hash value position and then move in a sequential manner through the slots until we encounter the first slot that is empty.

Note that we may need to go back to the first slot (circularly) to cover the entire hash table. This collision resolution process is referred to as open addressing in that it tries to find the next open slot or address in the hash table. By systematically visiting each slot one at a time, we are performing an open addressing technique called linear probing. Using the hash values from the remainder method example, when add $44$ and $55$ say:

Once we have built a hash table using open addressing and linear probing, it is essential that we utilize the same methods to search for items. we are henced forced to do sequential search to find $44$ and $55$.

So, a disadvantage to linear probing is the tendency for clustering; items become clustered in the table. This means that if many collisions occur at the same hash value, a number of surrounding slots will be filled by the linear probing resolution. This will have an impact on other items that are being inserted, as we saw when we tried to add the item 20 above. A cluster of values hashing to 0 had to be skipped to finally find an open position.

One way to deal with clustering is to extend the linear probing technique so that instead of looking sequentially for the next open slot, we skip slots, thereby more evenly distributing the items that have caused collisions. This will potentially reduce the clustering that occurs, e.g. with a “plus 3” probe. This means that once a collision occurs, we will look at every third slot until we find one that is empty.

The general name for this process of looking for another slot after a collision is rehashing. With simple linear probing, in general, $rehash(pos)=(pos+skip)$%$sizeoftable$. It is important to note that the size of the “skip” must be such that all the slots in the table will eventually be visited. Otherwise, part of the table will be unused. To ensure this, it is often suggested that the table size be a prime number. This is the reason we have been using $11$ in our examples.

A variation of the linear probing idea is called quadratic probing. Instead of using a constant “skip” value, we use a rehash function that increments the hash value by 1, 3, 5, 7, 9, and so on. This means that if the first hash value is $h$, the successive values are $h+1$, $h+4$, $h+9$, $h+16$, and so on. In other words, quadratic probing uses a skip consisting of successive perfect squares:

An alternative method for handling the collision problem is to allow each slot to hold a reference to a collection (or chain) of items. Chaining allows many items to exist at the same location in the hash table. When collisions happen, the item is still placed in the proper slot of the hash table. As more and more items hash to the same location, the difficulty of searching for the item in the collection increases:

When we want to search for an item, we use the hash function to generate the slot where it should reside. Since each slot holds a collection, we use a searching technique to decide whether the item is present. The advantage is that on the average there are likely to be many fewer items in each slot, so the search is perhaps more efficient.


In [155]:
set_of_integers = [123456, 431941, 789012, 60375]
print(set_of_integers)
set_of_integers = [((int(str(x)[0:2]) + int(str(x)[2:4]) + int(str(x)[4:])) % 80) -1 for x in set_of_integers]
print(set_of_integers)


[123456, 431941, 789012, 60375]
[21, 22, 19, 21]

In [ ]: